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Abstract. The couplings in a sparse asymmetric, asynchronous Ising network are 
reconstructed using an exact learning algorithm. Li regularization is used to remove 
the spurious weak connections that would otherwise be found by simply minimizing 
the minus likelihood of a finite data set. In order to see how Li regularization 
works in detail, we perform the calculation in several ways including (1) by iterative 
minimization of a cost function equal to minus the log likelihood of the data plus an 
Li penalty term, and (2) an approximate scheme based on a quadratic expansion of 
the cost function around its minimum. In these schemes, we track how connections are 
pruned as the strength of the Li penalty is increased from zero to large values. The 
performance of the methods for various coupling strengths is quantified using ROC 
curves. 
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1. Introduction 

A crucial step in understanding how a complex network operates is inferring its 
connectivity from observables in a systematic and controlled way. This learning of 
the connections from data is an inverse problem. Recently, with the ongoing growth 
of available data, especially in biological systems, such inverse problems have attracted 
a lot of attention in statistical physics community. Examples of applications include 
the reconstruction of a gene regulation network from gene expression levels [1] and 
identification of the protein-protein interactions from the correlations between amino 
acids [2]. One proxy for such a problem is the inverse Ising model, where the parameters 
of the model (fields and interactions) are inferred from observed spin history. 

There has been a long history of inferring Gibbs equilibrium models, such as the 
equilibrium Ising model, where the fields and couplings are inferred from the measured 
means and correlations [3l |3l El [6] . The methods developed for learning the connections 
in these models as originally formulated, do not assume any prior belief about the 
network architecture and they only use the data to decide about that. Recently, though 
it has been shown that the connections can be inferred much more efficiently when a 
sparse prior, specifically the Li regularizer, is taken into account [71 [8]. However, our 
theoretical understanding of how Li regularization works is limited. 

In many practical applications, equilibrium models would be of limited use. For 
instance, most biological systems operate in out-of-equilibrium regimes. Consequently, 
equilibrium models are not usually good candidates to infer the interactions and may 
fall short even as generative models for describing the statistics of the data [9] . Several 
recent studies have thus moved to kinetic models, prescribing exact and approximate 
learnings for inferring the connections in non-equilibrium models [IDl El [12]. However 
this body of work has not yet exploited the potential power of Li regularization in 
inferring the connections. 

In this paper, focusing on the asynchronously updated Ising model, we will describe 
how Li regularization helps in inferring the connections in a non-equilibrium model. 
We try to shed light on the mechanics by which the regularization works through 
developing approximate ways of performing Li. We study how the regularization shrink 
the connections gradually with the increase of regularization parameter. 

The paper is organized as follows: The dynamics and the underlying network are 
described in section [21 an Li-regularized learning rule for an asynchronously updated 
kinetic Ising model is described in section [3l approximate learning algorithms, based 
on an expansion of the cost function of section [HI are derived in section [H and the 
performance of the learning rules is studied in section [5l The effects of different coupling 
strengths are explored in section [61 A discussion is given in section [71 
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2. Glauber dynamics and network 

We consider a kinetic Ising model endowed with Glauber dynamics [13j. Glauber 
dynamics describes the evolution of the joint probability of the spin states 
^(5*1, 5*2, Sn', t) in time t, following the master equation 

dp{s; t) 
Jt 

where, 



^Ui{-Si)p{si, -Si, ...,SN;t) -J2^iisi)pis;t). (1) 



1 + exp[2si{t)Hi{t)\ 2 

is the probability for spin i to change its state from Si{t) to —Si(t) during the time 
interval dt. Here, we choose time units so that 70 = 1. The quantity Hi(t) = 
hi + J2j JijSj{t) is the instantaneous field acting on spin i. The external field hi can 
be dependent on time, but for the sake of simplicity we focus on the stationary case, 
i.e., time-independent hi, here. 

One way to implement the Glauber dynamics is as follows: Make a discretization of 
the evolution process with very small time steps 6t <^1/N. At each step, every spin is 
selected for updating with probability 5t. For 5t ^ almost certainly only one spin 
at a time will be updated. The next value of the spin selected for updating is chosen 
according to 

p{s.{t + 5t)\sm) = '""^afsh^^i^^^^^^ = i[l + ..(t + 5t)tanhff,(t)]. (2) 

Note that the updated spin might not change its value; an update is not necessarily a 
flip. In this paper, we will take the dynamics to be defined in this doubly stochastic way 
and assume that the data accessible to us include both the times at which every spin is 
selected for updating (determined by an independent Poisson processes for each spin) 
and the result of those updates (whose outcomes are given by ([2]), i.e., the spin history). 
The problem may also be treated by other algorithms that only assume knowledge of 
the spin history (not of all the update times); these are discussed in other work [H], but 
we do not consider them here. In our computations, in order not to waste lots of time 
not updating any spins, we have, at each time step, chosen exactly one spin at random 
for updating. For finite this is not exactly the dynamics described above, but we 
do not see any difference when we compare the results of our computations with those 
done following the correct dynamics exactly. 

We study a diluted binary asymmetric Sherrington-Kirkpatrick (SK) model with 
these dynamics. For the original SK model, the pairwise interactions Jjj between spins 
i and j were i.i.d. Gaussian variables (except Jij = Jji) with variance g'^/N and mean 
0. In the model we study here, the network is diluted, Jij is independent of Jji, and the 
interactions vary only in sign, not in magnitude: Each coupling has the distribution 
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where c is the average in-degree (and out-degree). We are interested in sparse networks, 
i.e., c <^ N. In our computations, we use = 40 and c = 5. Furthermore, as mentioned 
above, we model asymmetrically coupled spins, taking each Jij independent of Jji. This 
model can have a stationary distribution (and does for the parameters we use here), but 
it is not of Gibbs-Boltzmann form, and no simple expression for it is known. 

3. Exact learning 

As described above, we suppose we know the full history of the system - both the 
{sj(t)}, with 1 < i < N and 1 < t < L, where L is the data length, and the update 
times {Ti}. We can reconstruct the couplings Jij and external fields hi by performing 
the gradient descent on the negative log-likelihood of this history, which is given by 

- = - EE + ^t)Hi{Ti) - log2coshff,(r,)] . (4) 

i Ti 

We can minimize the log-likelihood by simple gradient descent with a learning rate rj: 

dC 

^•^ij = = vYli^iin + ^t) - tanh Hi{Ti)]sj{Ti). (5) 

This equation includes the learning rule for the external field hi under the convention 
Jio = hi, So(t) = 1. It has the same form as that for a synchronous model, except that 
changes for spin i are made only at times r^. 

For finite L, this procedure will in general produce a densely-connected network. 
To sparsify it, we add a simple regularization term that penalizes dense connectivity in 
a controllable fashion. We then minimize a cost function 

E = -Co + Aj2\-M- (6) 

ij 

where the first term is the negative log-likelihood and the second term is the Li norm. 
There are several efficient methods have been used to minimize the cost function ([6]), 
e.g., the interior-point method [T^ [TB] . However, in order to see how Li regularization 
works in detail, we study a simple gradient descent algorithm here. Gradient descent 
on this cost function leads to an additional term in the learning rule for couplings: 

SJij = Vj |E + " tanh iJi(ri)] Sj{Ti) - Asgn(Jij)| . (7) 

The log-likelihood function Co is smooth and convex as a function of the Jij and hi, 
so the cost function is concave except on the hyperplanes where any Jij = 0. This leads 
to complications in the minimization whenever a minimum of E is at Jij = 0: We deal 
with this problem by setting Jij = whenever the change (JTj) would cause Jij to change 
sign. Then, if the minimum of E truly lies at this Jij = 0, the estimated Jij will oscillate 
between zero and a small nonzero value (using sgn(O) = 0). However, the size of these 
oscillations is proportional to the learning rate rj, so a sufficiently small rj ensures that 
these couplings can be pruned by a simple rounding procedure, with negligible chance 
of removing coupling that are not truly zero at the minimum. In the case that Jij is 
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not zero at the minimum, its estimated value will continue to change and it will move 
toward its optimal value after the step where it was set to zero. 

Another way to deal with the non-differentiability of the cost function ([6]) is to use 
A/iX]i,j logcosh(Jjj//x) the penalty term and take the limit /i — )■ 0. This term leads 
to the replacement of the Asgn(Jjj) by A tanh( Jjj//i). For any non-zero /i, this modified 
cost function is totally convex. We checked some of our computations by doing the 
regularization this way. No difference between these results and those done as described 
above was found. 



4. An approximate learning scheme 

We can get some insight into the dynamics of the learning with regularization by 
expanding the cost function to second order around its minimum J° when A = 0. 
Up to a constant, we have 

ijk ij 



where Vij = Jij — J^j, T = L/N is the number of updates per spin, A = A/T, and 

C« = ^E(l - tanh' H^{T,))6s,{n)6s,{T,). (9) 



Since the quantities in the sum in iQ are insensitive to whether spin i is updated, the 
average over updates may safely be replace by an average over all times, 

Cfi = ((1 - te^nh^ H^{t))6s,{t)6sk{t))t (10) 

the Fisher information matrix for spin i, which is a more robust quantity. 
Minimizing ([8]), we get, to first order in A, 

Y^C^v^k = -Asgn(J° +^;,,) ^ -Asgn(J°). (11) 

k 

Solving this equation for Vij,we obtain: 

-^EM",'sgn(4)- (12) 



Vij 



k 

This equation shows how the regularization term shrinks the magnitudes of the 
couplings. 

In the weak coupling limit (small g or, equivalently, high temperature), C*-*^ = 

L Ijk 

6jk, SO the Jij are just shrunk in magnitude proportional to A until they reach zero and 
are pruned. This is a trivial kind of regularization: We know that the couplings that 
survive the pruning procedure the longest are simply the ones with the biggest initial 
absolute values. In this case, there is no need to go through the elaborate learning- 
with- regularization procedure of ([7]). However, at larger coupling this is not the case. 
Some Jij will be shrunk more rapidly than others, depending on the size and signs of 
the terms in the sum in ffT2l). 
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Based on the quadratic expansion ([H]), we can carry out the pruning in an 
approximate alternative fashion, as follows: Starting from Jfj and a small value of A, we 
calculate the shifts Vij by ( fT2l) and remove any Jij that would go though zero. Starting 
from the resulting new Jijs (some of them now equal to zero), increase A, recalculate 
the Fisher information matrix and calculate new shifts in the parameter values. Again 
remove any couplings that change sign, and continue until the desired degree of pruning 
has been achieved. This amounts to numerical integration of the differential equation, 
describing a kind of dynamics of regularization under increasing A. 



Note that this procedure requires only equal-time average quantities, unlike the full 
computation following the learning rule ([7]). 

If one knows a priori what value of A to use, it is probably not an advantage to 
use this algorithm. One can simply do the full computation once, at that value, while 
with this approximate algorithm we have to simulating the model to estimate the Fisher 
matrices at all the intermediate As in integrating ( !T3|) . On the other hand, one may 
not know the optimal A. It then becomes necessary to explore the regularized model 
over some wide range of A. In this case, the approximate algorithm will have a speed 
advantage, because it only requires a learning loop at the initial A (zero, in the case 
described here). 

5. Results 

We consider the problem of identifying the positive and negative couplings in the 
network, i.e., correctly classifying every potential bond as +, — or 0. Consider first 
the couplings Jijs found with no regularization, i.e., A = 0. For given g, c and A^, for 
very large T the inferred Jij will be very close to the true ones. A histogram of their 
values will have three narrow peaks around and ±(7/-\/c, and it will be trivial to identify 
the true nonzero couplings and their signs (figure la,b). In the opposite limit (small 
T), the data are not sufficient to estimate the couplings well. The histogram will be 
unimodal, and it will be more or less hopeless to solve the problem, even with the help of 
Li regularization (figure lc,d). The interesting case is that of intermediate data length, 
for which the partial histograms from the zero and nonzero-J classes overlap, but the 
separations between their means are not much smaller than their widths (figure le,f). 
We would also like to avoid the trivial weak-coupling case mentioned above, so in the 
following results we report here we take g = 1/V2. For this case, a T of 200 realizes the 
interesting intermediate-data-length case. 




(13) 
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Figure 1. Distribution of the inferred couplings without Li regularization, g = l/y/2 
for various data lengths. Top: T — 2000 updates/spin. Middle: T ~ 50. Bottom: 
T = 200. In each row, the left panel shows a histogram of the Jy obtained, and the 
right panel shows these sorted according to whether the bond was present (green) or 
absent (red) in the network that generated the data. 



Based on Js inferred with Aq = as shown in figure [T^ and f, four pruning 
methods were employed. Figure [2] shows how the Js inferred by each method vary 
as the regularization coefficient A is increased. Here, we only show positive Jqs; graphs 
of the negative ones would look like the ones shown, refiected through the horizontal 
axis. Bonds actually present in the model (a realization of ([3])) are plotted in black and 
bonds which are absent in red. 

Figure [2^ shows the Js inferred using exact learning with Li regularization ([7]). It 
is apparent that the pruning process for the case shown here is not trivial in the way it 
would be in the weak-coupling limit: Some true (black) bonds, for which rather small 
values were inferred at A = because of insufficient data, are "rescued" (they fall off 
more slowly with A than red ones with nearly the same initial inferred Js), and some 
spurious (red) bonds with high inferred values at A = are driven to zero faster than 
black ones with the same initial inferred Js. Thus, the red and black lines tend to be 
separated, and one can do the pruning almost correctly just by turning A up until the 
desired number of bonds have been removed. 

Figure [2]d shows the inferred Js using the quadratic expansion ([8]) in the fashion 
described at the end of Sec. HI We call this "approximation 1" . The qualitative features 
of figure [2^ are apparently reproduced in this approximation. 

Figure Eb shows the result when off-diagonal elements of C-'HA) are ignored 

L J jk 

in f |T3l) . We refer to this procedure as "approximation 2". The separation of red and 
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Figure 2. Inferred couplings as functions of regularization coefficient A for four 
methods: (a) full Li regularization using ([7]), (b) integration of (fT3)) . (c) integration of 
(|13p with diagonal approximation of the inverse Fisher matrix, (d) linear extrapolation 
in A of the curves in (c). Black lines represent bonds actually presents, while red lines 
represent ones equal to zero in the network used to generate the data. We show equal 
number of red and black ones. 



black curves is not as good in this case. We also tried making a diagonal approximation 
of the Fisher matrix itself, rather than its inverse: Cj]j by C'f^djk- However, this gave 
much worse results (not shown) than making the diagonal approximation on the inverse 
Fisher matrix. 

In figure |2t, it is evident that the slopes of the Jij(A) curves vary rather slowly 
with A. Therefore, we also tried a linear extrapolation based on the slopes of the curves 
in figure at A = 0. We denote this method as "Approximation 3". To the extent 
that this simple procedure works, one can identify the nonzero bonds with very little 
computation: One needs only to do the learning at A = (to get the Jjj(A)) and 
calculate the Fisher matrices (to get the dJij/dX). Figure [2li shows the result of this 
minimal algorithm. 

For Approximation 3, the inferred Js that have been shrunk to zeros have no chance 
to be rescued again. But for the other three approaches, the inferred Js for the positive 
ones (as shown in black lines) have that chance to be back again with increasing of A. 
However, in the results presented in figure [21 we haven't observe such phenomena. 

One could also try similar linear extrapolation based on the initial slopes of the 
upper panels of figure |2j However, these curves show significant curvature for A < 30 or 
so, so the initial slopes are not good guides to the ultimate fate of the bonds at large A, 
and we do not present any results for these methods. 
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In what follows, we quantify the performances of these four pruning algorithms. 
For the three classes of bonds in the actual network, — , + and 0, we can compute the 
empirical classification errors. These errors can be either false positives (FP) (identifying 
a bond which is really absent as present), or false negatives (FN) (identifying a bond 
which is actually present as absent). In addition, a + bond could be misclassified as — 
or vice versa, but this does not happen for the data length we are studying here. 

At A = 0, where in general all bonds will be estimated to have nonzero values, there 
will be no FNs and N{N — c) FPs. In the other limit A oo, all bonds will be removed, 
so there will be cN FNs and no FPs. The empirical numbers of FPs and FNs versus A 
are plotted in the left panels of figure [3l The total misclassification error, i.e., the sum 
of the FPs and FNs (shown in the right panel of figure [3]) has a minimum at A ~ 33.5 for 
full Li regularization. For Approximation 1 we find a minimum at A ~ 31.5, while for 
Approximation 2 the minimum is at A ~ 30, and for Approximation 3 it is at A ~ 24. 
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Figure 3. Dependence of classification errors on A. Left column: Number of 
misclassified — , +, and (absent) bonds. Numbers of false negatives for — s arc shown 
in green, for +s in red, and false positives for zero-bonds in blue. Right column: the 
sum of false negatives and false positives versus A. Because the Js are symmetrically 
distributed, red and green curves almost coincide, with mostly only the green ones 
visible here. From top to bottom: full Li regularization and Approximations 1, 2, 3, 
respectively. 

In applications, FNs and FPs may not have the same cost associated with them: 
it may be appropriate to weight the blue and green curves in the left-hand panels of 
figure |3] differently. To compare algorithms in a more general way that is not specific to 
a particular relative weighting, we calculate Receiver Operating Characteristic (ROC) 
curves for them. For a given A, the false positive rate (FPR) is defined as the number 
of FPs divided by the actual number of zero bonds, and the false negative rate (FNR) 
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is defined as the number of FNs divided by tlie number of actual number of non- zero 
bonds. A true positive (TP) is the identification of a bond which is actually present 
as present, and the true positive rate (TPR) is the number of TPs normalized by the 
actual number of bonds present. It is equal to 1 — FNR. The ROC curve is a plot of 
TPR versus FPR. Each value of A gives one point on the curve. In figure lU we plot the 
ROC curves for all of our methods. We also measure the performance of the different 
methods quantitatively by defining an error measure, e: 

e = 1 — area under ROC curve. (14) 

The values of es for full Li and Approximations 1, 2 and 3 are 0.03, 0.06, 0.08, 0.09 
respectively. Thus, full Li algorithm performs best, followed by approximation 1. 
Approximation 2 works worse than them and it is only little better than approximation 
3 for most values of A, as can be seen in figure HI 

To establish a baseline for the goodness of our methods, we also performed a simple 
pruning procedure that does not require any Li regularization calculation. For a given 
cut value J, we identify the bonds whose Js lie in the range [—J, J] as absent and those 
outside that interval as present. The green Js in figure [1}: which lie within the interval 
are FNs and the red ones outside the interval are FPs. Varying J, we obtain an ROC 
curve. We refer to this procedure as "JO-cut". The curve with light blue squares in 
figure m is for it. The curve nearly coincides with that for Approximation 3. Its e is 
0.09, the same as that of Approximation 3. Thus, this trivial method works as well as 
Approximation 3. 
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Figure 4. ROC curves for full Li regularization, Approximations 1, 2, 3, and the 
JO-cut method are shown in red, green, blue, pink and light blue, respectively. 

6. Effects of coupling strength g on Li regularization 

The above results were all obtained for g = 1/ a/2. We are also interested in how the 
different regularization methods behave for other g values. As we mentioned in section 
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mthat in the weak couplings limit g 0, the inverse of the Fisher information matrices 
for different is are the same, equal to identity, thus the regularizations by Approximation 
methods will be equal to that of the trivial JO-cut. 

We repeat the calculation ROC curve for two other gs: 1 and 1/2. To get problem 
of the same level of difficulty, we first calculate the ROC curve by JO-cut method and 
make sure that the area under the curves are the same for each g. A set of data lengths 
L = NT, for which the same areas under the curves can be obtained are found to be 
11608, 8862 and 6730 for g = 1/2, l/\/2 and 1 respectively. No bigger g values are 
tested because they need shorter data length to get the same area, however, short data 
length increases the difficulties of the learning rule. The ROC curves for all three gs are 
shown by the dashed lines in both figure |5|^a) and (b). The e value are all around 0.94 
for them. 

With this stating point, we calculate the ROC curves for full Li regularization and 
Approximation 1 for all three gs. As noted in figure HJ the regularization by full Li and 
Approximation 1 have obviously better performances compared with that of the trivial 
JO-cut method, thus we next focus on this two methods to test whether regularization 
helps more at larger g. In figure [5]^a), the solid lines represent the ROC curves by full 
Li regularization. The e are 0.033 for g = 1/2, 0.023 for g = 1/V2 and 0.012 for g = 1. 
Similarly, in figure |5]^b), the solid lines are for ROC curves by Approximation 1, with 
area 0.047, 0.039 and 0.035 for g = 1/2, 1/^/2 and 1 respectively. As shown by the solid 
lines in both figure |5]|^a) and (b), we can see that with increasing of g, the regularization 
methods work better. Both of them perform better than the trivial method, which is 
accordant with the results shown in figure HI 
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Figure 5. ROC curves for full Li regularization (left, solid lines) and Approximations 
1 (right, solid lines) with S = ^ respectively. The green lines for g = i, red 

for g = and black for g = 1- Corresponding dashed lines are for JO-cut method of 
these gs. 
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7. Discussion 

We have studied the reconstruction of sparse asynchronously updated kinetic Ising 
networks. With finite data length, simple maximization of the log likelihood of the 
system history will infer nonzero values to many bonds that are actually not present. For 
large data length, this is generally not a problem, since the inferred bond distribution 
will consist of well-separated peaks. The ones with the smallest absolute values can 
then safely be identified as spurious and removed "by hand" . However, for smaller data 
lengths, these peaks can overlap strongly, and nontrivial methods are required to make 
an optimal pruning of the inferred coupling set. Here we used Li regularization to do 
this, minimizing a cost function that includes the Li-norm of the parameter vector as 
a penalty term. We performed this minimization in four ways, one exact and the other 
three involving various degrees of approximation. 

Calculations on a model network at intermediate coupling strength revealed that 
the exact Li regularization classified the bonds significantly better than a naive method 
based on retaining the strongest bonds. Our Approximation 1 was somewhat worse than 
the exact algorithm, but still significantly better than the naive method. Our other two 
approximations, obtained by successive simplifications of Approximation 1, however, 
did not perform measurably better than the naive way, as measured by the areas under 
their ROC curves. These conclusions are general to various coupling strength we used. 
The regularizations helps more with stronger coupling strengths. 

This work is the first that we know of that takes a detailed look at how 
Liregularization works in the non-equilibrium model, by studying how bonds are 
removed successively as the regularization parameter A is increased. Some insight into 
how this happens was made possible by studying the quadratic expansion of the cost 
function about its minimum, which also led to our relatively successful Approximation 
1. The process would have been more transparent if we could have made further 
simplifying approximations, as we did for Approximation 2, where we neglected off- 
diagonal elements of the inverse Fisher matrices. The fact that this approximation 
performed rather poorly (while Approximation 1 did quite well) indicates that the off- 
diagonal terms in ( IT3|) are necessary, and we lack generic insight about them. 

We performed our analysis here on a rather simple model network. However, we 
expect that our methods will be useful in analyzing date from a wide variety of biological, 
financial, and other complex systems with sparse structure. 
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